Exploratory Data Analysis¶
Analyzing the NYC Airbnb dataset for price prediction.
In [1]:
%matplotlib inline
import wandb
import pandas as pd
import matplotlib.pyplot as plt
Load data from W&B¶
In [2]:
run = wandb.init(project="nyc_airbnb", group="eda", save_code=True)
local_path = wandb.use_artifact("sample.csv:latest").file()
df = pd.read_csv(local_path)
wandb: Currently logged in as: danieludacity (danieludacity-udacity) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
Tracking run with wandb version 0.22.3
Run data is saved locally in
/home/daniel/DeepAI-Learn/Misc/ML_Workflow/build-ml-pipeline-for-short-term-rental-prices/src/eda/wandb/run-20251030_125644-vw18h74f
View project at https://wandb.ai/danieludacity-udacity/nyc_airbnb
In [3]:
df.head()
Out[3]:
| id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 9138664 | Private Lg Room 15 min to Manhattan | 47594947 | Iris | Queens | Sunnyside | 40.74271 | -73.92493 | Private room | 74 | 2 | 6 | 2019-05-26 | 0.13 | 1 | 5 |
| 1 | 31444015 | TIME SQUARE CHARMING ONE BED IN HELL'S KITCHEN... | 8523790 | Johlex | Manhattan | Hell's Kitchen | 40.76682 | -73.98878 | Entire home/apt | 170 | 3 | 0 | NaN | NaN | 1 | 188 |
| 2 | 8741020 | Voted #1 Location Quintessential 1BR W Village... | 45854238 | John | Manhattan | West Village | 40.73631 | -74.00611 | Entire home/apt | 245 | 3 | 51 | 2018-09-19 | 1.12 | 1 | 0 |
| 3 | 34602077 | Spacious 1 bedroom apartment 15min from Manhattan | 261055465 | Regan | Queens | Astoria | 40.76424 | -73.92351 | Entire home/apt | 125 | 3 | 1 | 2019-05-24 | 0.65 | 1 | 13 |
| 4 | 23203149 | Big beautiful bedroom in huge Bushwick apartment | 143460 | Megan | Brooklyn | Bushwick | 40.69839 | -73.92044 | Private room | 65 | 2 | 8 | 2019-06-23 | 0.52 | 2 | 8 |
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20000 entries, 0 to 19999 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 20000 non-null int64 1 name 19993 non-null object 2 host_id 20000 non-null int64 3 host_name 19992 non-null object 4 neighbourhood_group 20000 non-null object 5 neighbourhood 20000 non-null object 6 latitude 20000 non-null float64 7 longitude 20000 non-null float64 8 room_type 20000 non-null object 9 price 20000 non-null int64 10 minimum_nights 20000 non-null int64 11 number_of_reviews 20000 non-null int64 12 last_review 15877 non-null object 13 reviews_per_month 15877 non-null float64 14 calculated_host_listings_count 20000 non-null int64 15 availability_365 20000 non-null int64 dtypes: float64(3), int64(7), object(6) memory usage: 2.4+ MB
Generate profile report¶
In [5]:
import ydata_profiling
profile = ydata_profiling.ProfileReport(df)
profile.to_notebook_iframe() # widget don't work :/
Upgrade to ydata-sdk
Improve your data and profiling with ydata-sdk, featuring data quality scoring, redundancy detection, outlier identification, text validation, and synthetic data generation.
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
0%| | 0/16 [00:00<?, ?it/s] 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 85.24it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
Observations¶
From the profile report and initial inspection:
- Missing values in several columns
last_reviewis string format, should be datetimepricehas outliers (very low and very high values)
Based on stakeholder input, reasonable price range is $10-$350.
Analyze price distribution¶
In [6]:
df['price'].describe()
Out[6]:
count 20000.000000 mean 153.269050 std 243.325609 min 0.000000 25% 69.000000 50% 105.000000 75% 175.000000 max 10000.000000 Name: price, dtype: float64
In [7]:
fig, ax = plt.subplots(1, 2, figsize=(12, 4))
ax[0].hist(df['price'], bins=50)
ax[0].set_title('Price Distribution')
ax[1].boxplot(df['price'])
ax[1].set_title('Price Boxplot')
fig
Out[7]:
Clean the data¶
In [8]:
# Remove price outliers
min_price = 10
max_price = 350
idx = df['price'].between(min_price, max_price)
df = df[idx].copy()
# Convert last_review to datetime
df['last_review'] = pd.to_datetime(df['last_review'])
Verify cleaned data¶
In [9]:
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 19001 entries, 0 to 19999 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 19001 non-null int64 1 name 18994 non-null object 2 host_id 19001 non-null int64 3 host_name 18993 non-null object 4 neighbourhood_group 19001 non-null object 5 neighbourhood 19001 non-null object 6 latitude 19001 non-null float64 7 longitude 19001 non-null float64 8 room_type 19001 non-null object 9 price 19001 non-null int64 10 minimum_nights 19001 non-null int64 11 number_of_reviews 19001 non-null int64 12 last_review 15243 non-null datetime64[ns] 13 reviews_per_month 15243 non-null float64 14 calculated_host_listings_count 19001 non-null int64 15 availability_365 19001 non-null int64 dtypes: datetime64[ns](1), float64(3), int64(7), object(5) memory usage: 2.5+ MB
In [10]:
df['price'].describe()
Out[10]:
count 19001.000000 mean 122.340456 std 71.530346 min 10.000000 25% 66.000000 50% 100.000000 75% 160.000000 max 350.000000 Name: price, dtype: float64
In [11]:
plt.hist(df['price'], bins=50)
plt.title('Price Distribution (Cleaned)')
plt.show()
In [ ]:
run.finish()